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Abstract 

Stencil computations consume a major part of runtime in many scientific simulation 
codes. As prototypes for this class of algorithms we consider the iterative Jacobi and 
Gauss-Seidel smoothers and aim at highly efficient parallel implementations for cache- 
based multicorc architectures. Temporal cache blocking is a known advanced optimiza- 
tion technique, which can reduce the pressure on the memory bus significantly. We 
apply and refine this optimization for a recently presented temporal blocking strategy 
designed to explicitly utilize multicore characteristics. Especially for the case of Gauss- 
Scidcl smoothers we show that simultaneous multi-threading (SMT) can yield substantial 
performance improvements for our optimized algorithm. 

Keywords: stencil computations; spatial blocking; temporal blocking; wavefront 
parallelization; multicore; simultaneous multi-threading 



1. Introduction 

Stencil computations can be found at the core of many scientific and technical appli- 
cations based on regular lattices. For the important class of partial differential equation 
(PDE) solvers they are a key performance factor. This does not only hold for serial 
applications but is also true for massively parallel large scale multigrid PDE solvers (see 
e -g- [H ) j where the time-consuming smoothing steps are frequently composed of stencil 
computations such as red-black Gauss-Seidel or Jacobi schemes. 

It has been shown recently Q that for state-of-the-art multicore architectures a near 
to optimal stencil implementation requires elaborate tuning, even if more complex tempo- 
ral blocking techniques are ignored. Conventional temporal blocking performs multiple 
updates on a small block of the computational domain before proceeding to the next 
block. Apart from having a machine dependent tuning parameter, this kind of temporal 
blocking in three spatial dimensions has been found to not deliver performance improve- 
ments [6[ because it generates rather short loops resulting in substantial performance 
penalties from pipeline start-up effects and, even worse, from strongly restricted data 
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prefetching abilities. In [7[ it was shown that adapted variants of these temporal block- 
ing techniques, operating on whole lines instead of rectangular small blocks, show very 
good results on modern architectures also in 3D. Cache oblivious algorithms as proposed 
by Frigo ct al. Q arc hardware independent but come at the cost of irregular block 
access patterns, which cause many data TLB misses. This was shown for a 3D lattice 
Boltzmann (LB) application kernel in Ref. Q. For an overview about stencil algorithm 
specific optimizations we refer to Q ■ 

Recently a proof of concept for a wavefront-based shared memory parallelization 
scheme was first presented Q. It allows implicit temporal blocking for stencil computa- 
tions in multicore environments with shared caches. The basic idea is to run multiple 
wavefronts through the computational domain at the same time but appropriately shifted 
in space (depending on the stencil used). Each wavefront represents an update step of 
the lattice and is executed by a single thread. Binding all threads that run successive 
wavefronts for the same computational domain to a single multicore chip with a shared 
cache restricts data access to main memory to a load operation for the initial wavefront 
and a store operation for the final wavefront; all other (intermediate) data accesses can 
be satisfied from the shared cache. The scheme is tailored to multicore architectures with 
shared caches — which will be the major design principle for most standard architectures 
in the years to come — and does not exhibit the drawback of very short loop lengths. 

This paper introduces a more generic and flexible implementation of the wavefront 
technique with an improved spatial blocking scheme and highly efficient synchronization 
primitives. In addition to the application on Jacobi, already shown in we extend the 
method to the lexicographic Gauss-Seidel smoother, whose data dependiencies require 
substantial modifications in thread scheduling Q . Moreover we show how our technique 
can leverage SMT threads on Intel processors increasing the utilization of the floating 
point units. 

All modern x86-based compute nodes employ scalable ccNUMA architectures and 
thus we focus on single socket results only. The parallelization strategies used in this 
report are known to scale well on these architectures. 

2. Experimental Testbed 

A wide selection of modern x86-based multi-core processors (cf. Tab. [1} has been 
chosen to try different variants of the wavefront parallelization strategy and to demon- 
strate its performance potentials. All of these chips feature a large outer level cache, 
which is shared by two (Intel Harpcrtown), four (Intel Nehalem EP), six (Intel Westmerc 
EP, AMD Istanbul) or eight cores (Intel Nehalem EX). The maximum number of cores 
sharing an outer level L2/L3 cache will be denoted as "L2/L3 group" in the following. 
The Harpcrtown processor (implementing the Core 2 architecture) is usually considered 
as a quad-core chip since four cores are put in a single package (see Fig.[T](a)). However, 
it is built up from two independent L2 groups without a shared L3 cache and thus we will 
consider it as two independent dual-core processors, i.e. two L2 groups. The Nehalem 
EP is Intel's first quad-core chip featuring a shared L3 cache for all four cores (L3 group) . 
In addition a complete redesign of the memory subsystem allowed for a substantial in- 
crease in main memory bandwidth at the cost of introducing a ccNUMA architecture 
for multi-socket servers. The follow-on processor (Westmerc) reflects a shrink in transis- 
tor size, which allows to increase both the number of cores as well as the L3 cache size 
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by 50% (Fig. Q] (b)). Intel also reintroduced simultaneous multithreading (SMT) with 
the Nehalem architecture, a hardware optimization to improve utilization of execution 
units. Each core supports two SMT threads. AMD's competitor to Intel Nehalem is 
the Istanbul processor design, which comes as a six core L3 group and is based on a 
ccNUMA architecture on the multi-socket node level as well. The 8-core Intel Nehalem 
EX processor is not mainly targeted at HPC clusters but at large mission-critical servers. 
Since it already implements a substantially improved cache architecture and can easily 
be manipulated on the main memory bandwidth side we have included it in this report 
to simulate future architectural developments. A comprehensive summary of the most 
important processor features is presented in Tab. [T] As the two iterative schemes con- 
sidered in this paper are known to be memory bandwidth intensive we have reported in 
Tab. [T] the maximum attainable main memory bandwidth as measured by the STREAM 
triad benchmark ll| with and without non temporal stores. In the latter case the full 
bus traffic including the write allocate transfer for the store stream is reported. For a 
more detailed analysis of the memory and cache hierarchy see [lij ]. Please note, that 
our Intel Nehalem EX test system was equipped with only half of the possible number 
of memory cards, reducing the bandwidth accordingly. The ability of pinning a selected 
team of threads to a single cache group and determining the cache group topologies of 
a multi-core processor is vital for the parallelization approach described in this report. 
In this context we use the open-source tool likwid [l2[, which has been developed in our 
group. 
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Figure 1: Cache and thread topology of Intel Core 2 Quad and Nehalem EP Westmere 
processors. 



3. Iterative schemes and baseline performance 

Iterative schemes based on regular stencil computations in three spatial dimensions 
are used in many numerical applications, e.g. linear solvers or multi-grid methods. As 
prototypes for this class of algorithm we have chosen the well-known Jacobi method for 
a Poisson problem and the Gauss-Seidel method for a Laplace problem. For reason- 
ably large data sets those methods are known to be data-intensive and the attainable 
main memory bandwidth imposes an upper limit for performance, which can be rather 
accurately modeled 0, In this section we will first briefly introduce both schemes, 
determine an upper performance limit (via main memory bandwidth as given in Tab.[l|, 
and introduce in case of Gauss-Seidel a simple code optimization to achieve the expected 
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Microarchitecture 


Intel Core 2 


Intel Nehalem EP 


Intel Westmere 


Intel Nehalem EX 


AMD Istanbul 


Model 


Xeon X5482 


Xeon X5550 


Xeon X5670 


Xeon X7560 


Opteron 2435 


Clock [GHz] 


3.2 


2.66 


2.93 


2.26 


2.6 


Cores per socket 


4 


1 


6 


8 


6 


SMT threads per core 


N/A 


2 


2 


2 


N/A 


LI Cache 


32 kB 


32 kB 


32 kB 


32 kB 


64 kB 


Associativity 


8 


8 


8 


8 


2 


L2 Cache 


2x6 MB (shared) 


4x256 KB 


6x256 KB 


8x256 KB 


6x512 KB 


Associativity 


21 


8 


8 


8 


16 


L3 Cache (shared) 




8 MB 


12 MB 


24 MB 


6 MB 


Associativity 




16 


16 


24 


48 


Bandwidths [GB/s]: 


Theoretical socket BW 


12.8 


32.0 


32.0 


17.1 


17.1 


STREAM 1 thread 


4.6 


11.9 


11.0 


5.3 


7.2 


STREAM socket NT/noNT 


4.8/5.6 


18.5/23.7 


21.0/23.6 


9.1/13.6 


9.8/11.4 
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Figure 2: Stencil structure and corresponding mapping in memory space. 



performance number. Those optimal measurements will be the baseline to compare with 
for the multi-core aware parallelization approach. 

The Jacobi scheme in three spatial dimensions can basically be formulated as follows 
(we use C notation): 

for(iter=0; iter < it erEnd ; it er ++) { 
for(k=0; k<Nk; k++) { 
for ( j =0 ; j <Nj ; j ++) { 
for(i=0; i<Ni; i++) { 
dst [k] [j] [i] = a * sre [k] [j] [i] + b * ( 

sre [k] [j] [i-1] + sre [k] [j] [i + 1] + 
sre [k] [j -1] [i] + arc [k] [j+i] [i] + 
arc [k-1] [j] [i] + arc [k + 1] [ j] [i] ) ; 

} } > } 

Fig. [2] shows the basic update scheme of this kernel. The domain is decomposed into 
lines (y dimension) and planes (z dimension). The computational kernel updates one 
line at a time, and the seven point stencil in 3D is mapped to a memory access pattern 
with five streams, only one of which has to be loaded from memory if three planes fit 
in the outermost cache level (see Fig. |2] right). The store on the dst array generates 
an additional data stream if implemented using non temporal stores. We implement an 
optimized version of the innermost loop (line update kernel). This subroutine is used for 
all parallel variants which only modify the processing order of the outer loop nests. This 
ensures results comparable to the optimized variants in [jjj]. Fig. [3] (a) shows the serial 
baseline performance of the Jacobi kernel against a straightforward C implementation for 
the L2 /L3 cache group and main memory. The C code was compiled with Intel compiler 
ice 11.0 and standard optimization flags (-03 -xW -align -fno-alias). The same 
executable was used for all machines. Note that the C code could probably reach better 
performance by applying code transformations or including pragmas/intrinsics. Still this 
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(a) Serial (b) Socket 



Figure 3: Jacobi baseline for C language (C) and assembly (asm) kernels, comparing 
memory and outer-level cache performance. Domain sizes were chosen as 100x50x50 
(4 MB data set fitting in the outermost cache level) and 400x200x200 (256 MB data set 
to be loaded from main memory), respectively. For all machines all physical cores of a 
socket were utilized. 



comparison shows the effectiveness of the optimization techniques independent of their 
implementation level. For the memory domain a variant using non-temporal (streaming) 
stores was considered. As expected the highly clocked but bandwidth-starved Harper- 
town processor shows the largest drop between in-cache and main memory performance. 
The in-cache performance for the Nehalem variants is directly correlated with their clock 
speed. On Nehalem EP and Westmere the small drop between in-cache and main mem- 
ory performance shows that the serial Jacobi method is not primarily memory bandwidth 
limited on this machine. The Istanbul, despite its theoretically similar capabilities, shows 
a low performance, and there is no significant difference between optimized and C or in- 
cache and memory performance. The combination of exclusive caches and large cache 
latency overhead causes that a major part of the runtime has to be spent transferring 



within the cache hierarchy (cf. |14|). Therefore also the applied optimizations do not 
show a larger effect, and the actual processing of cachelines is only a small part of total 
runtime. In opposite all Intel processors show a high cache efficiency for these bandwidth 
demanding data access patterns. 

Turning to multi-threaded execution on a single socket one can assume that main 
memory bandwidth is saturated. Following above discussion, the minimum data transfer 
between main memory and cache hierarchy for a single cell update is one load and one 
store. As a performance measure we use the lattice site updates per second (LUP/s) 
metric. In this case a simple model for the maximum performance in terms of LUP can 
be set up (for double precision): 

p " = T^I Lup / 8 ' <>> 

The computer's attainable main memory bandwidth (Ms) can be measured e.g. with 
the threaded STREAM triad benchmark. This simple approach is known to provide a 
good upper performance limit for memory bandwidth limited situations. 
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We estimate the potential performance gain for temporal blocking by benchmarking 
the saturated L2 /L3 cache group performance with a dataset that fits in the outer cache 
level. The larger the difference between in-cache and main memory performance, the 
higher the expected performance improvement by temporal blocking. For a more elabo- 
rate prediction based on a diagnostic performance model, cf. [jjjj . Figure[3](b) shows the 
saturated L2/L3 cache group and saturated main memory performance, together with 
the limit predicted by (JlJ. Nehalem EP and Westmere show a more balanced perfor- 
mance between main memory and cache. Hence they are expected to benefit less from our 
optimizations. The measurements reveal that while Westmere clocks higher, the uncore 
(L3 cache and memory controller) has the same clock speed as Nehalem EP and there- 
fore reaches similar in-cache performance, despite its two additional cores. Nehalem EX 
introduces a novel segmented L3 cache which shows a near to perfect bandwidth scaleup 
with the number of cores. The memory results are well in line with the performance 
limits predicted based on the STREAM triad sustained bandwidth as listed in Tab. [TJ 

The Gauss-Scidcl method as opposed to Jacobi performs an in-place update. It can 
be formulated as follows (using C notation): 

for(iter=0; iter < iterEnd ; iter ++) { 
for(k=0; k<Nk; k++) { 
for ( j =0 ; j <Nj ; j ++) { 
for(i=0; i<Ni; i++) { 
src[k][j][i] = b * ( sre [k] [j] [i-1] + sr c [k] [ j ] [ i + 1] + 
sre [k] [j -1] [i] + sre [k] [j+1] [i] + 
sre [k-1] [ j] [i] + sre [k + 1] [j] [i] ) ; 

} } > } 

Gauss-Scidel looks similar to Jacobi at first glance in terms of stencil structure and 
data transfer volumes. A difference apparent at once is that SIMD vectorization is 
ruled out because of the recursive structure on the central line. The in-place update 
prevents optimal pipelining and the use of non-temporal stores as well. Thus Gauss- 
Seidel performance is inferior to Jacobi despite comparable data transfer volumes and less 
computations (Fig. [3] and S]). Additionally there is no substantial drop between in-cache 
and memory performance. This is especially remarkable on the Core 2 processor and 
clearly indicates that pipelining problems limit the achievable performance. To overcome 
these problems, our optimized kernel interleaves two updates in order to break up register 
dependencies and partially hide the recursion. The optimized version implements the 
following adjusted update scheme: 

for(iter=0; iter < iterEnd ; iter ++) { 
for(k=0; k<Nk; k++) { 
for ( j =0 ; j <Nj ; j ++) { 
tmpl = sre [k] [ j ] [2] + 

sre [k] [j -1] [1] + sre [k] [j+1] [1] + 
sre [k-1] [j] [1] + sre [k + 1] [j] [1] ; 
for(i=l; i<Ni-l; i++) { 
tmp2 = sre [k] [ j ] [i + 1] + 

sre [k] [j -1] [i] + sre [k] [j+1] [i] + 
sre [k-1] [j] [i] + sre [k + 1] [ j] [i] ; 
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Figure 4: Gauss-Scidcl baseline performance. Note that the C implementation does not 
employ the dependency optimization described in the text. Same data sets as in Fig. [3] 



src[k][j][i] = b * ( sre [k] [j] [i-1] + tmpl); 
tmpl = tmp2 ; 
> } } } 

Figure H] (a) shows the serial Gauss-Seidel results. A large part of the speedup between 
the optimized assembly implementation as compared to the C version can be attributed 
to this Gauss-Seidel specific code reordering. The small drop from in-cache to memory 
performance indicates that the sequential Gauss-Scidcl on Nchalcm EP and Westmere is 
not memory bandwidth limited anymore. Only on the bandwidth starved Harpertown 
there is still a significant drop between in-cache and main memory domain. Istanbul 
shows a much more competitive performance for the optimized code. The data transfers 
are still inefficient, but the optimizations can show their effect and the impact of the 
inefficient caches is smaller because now the LI runtime part is much larger. Westmere 
benefits from its two additional cores compared to Nehalem EP indicating that the L3 
bandwidth is not fully saturated by four threads. 

A further problem connected to the recursive nature of Gauss-Scidcl is that a straight- 
forward parallclization based on domain decomposition cannot be employed. A common 
solution is to use the Red-Black Gauss-Seidel method instead, which can be easily par- 
allelized. We chose another possibility to parallelize the standard lexicographic Gauss- 
Seidel method based on a pipeline parallel approach (see Fig. [5] (a)), which retains the 
same algorithm as a sequential Gauss-Seidel. Each thread operates on a sub-block. Plane 
updates of threads arc shifted in time to retain the correct update order. The threaded 
socket results for Gauss-Seidel are illustrated in Fig. SJ Compared to the Jacobi solver, 
the parallel Gauss-Seidel algorithm is still limited by the loop-carried dependencies in 
the kernel, which leads to a smaller performance difference between in-cache and mem- 
ory situations for all but the Westmere and Istanbul processors. Westmere can still 
hit the bandwidth limitations due to its larger core count, while Istanbul is known to 



be restrained by the large overheads for cache line transfers 14|, making the inefficient 
pipelining less dominant. 

As mentioned above, non-temporal stores cannot be applied. We therefore use the 
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Figure 5: Parallclization of the Gauss-Seidel algorithm by pipelied parallel execution (a) 
and the wavefront approach (b) 



STREAM triad measurements without non-temporal stores (Tab. [IJ in the performance 
model (JTJ) for Gauss-Seidel. 

4. Temporal Blocking through Multi-Core Aware Wavefront Parallelization 

Our wavefront parallclization technique implements implicit temporal blocking by 
utilizing the property of modern multicorc architectures that multiple cores share the 
outermost cache level. The grid is decomposed into blocks. A block is updated by 
a "thread group," consisting of a number of threads. Each thread in a thread group 
performs one sweep on the block, successively updating the "planes" in z direction. 
Planes updated by consecutive threads are guaranteed to be still located in the shared 
cache. This update mechanism is illustrated in Fig. [6] Because multiple updates are 
performed while holding the data in cache and the intermediate update steps need not 
be stored, the second grid of the out-of-place Jacobi method is not required. Odd- 
numbered updates are instead written to a temporary array, which is large enough to 
hold the intermediate steps of all updates until the last update is written back to the 
sre array. In step (a) the first thread of a group performs the first plane update, loading 
from sre and writing to a small temporary array. With a distance of two to ensure the 
correct update order, the second thread performs the second update from the temporary 
array back to the sre array in step (b). In our example with four threads another 
two updates are performed until the fourth update is written to the sre array. The 
temporary array is used in a round robin manner, and hence shifted through the z 

9 



■ First Update 
[| Second Update 
□ Third Update 




(c) (d) 
Figure 6: Temporal wavefront blocking (1 thread group with four threads) 

dimension of the spatial grid. It must be large enough to hold the needed dst planes 
of all threads, for our example eigth. The threads have to be synchronized after each 
plane update, and block sizes must be chosen so that the temporary data can be kept 
in the outermost cache level. Since the number of thread groups, the number of threads 
per thread group and the block size can be freely configured this scheme can map to the 
underlying hardware in a very flexible way. A drawback of the current implementation is 
that the maximum number of blocked updates is determined by the number of threads 
available. This method requires fine-grained parallelism, making it crucial to employ 
an efficient synchronization mechanism. Our threaded implementation is based on the 
POSIX thread API. The pthread barrier turned out to have a very large overhead, making 
it unsuitable for fine-grained parallelism. For small thread counts as applicable on a single 
socket, an implementation of a spin waiting loop was used for the barrier. Since this does 
not perform well with SMT threads, a tree barrier was implemented which provided less 
overhead whenever more than one logical thread per core was used. 

The spatial blocking scheme is illustrated in Fig. [7] Every thread group performs a 
parallel wavefront update as explained above. The domain is decomposed into B blocks 
along the y dimension (8 in [7]). Each thread group works on one or more blocks. All 
thread groups update each block in a synchronized fashion, and one z sweep is performed 
on the first N blocks, where N is the number of thread groups. The boundary between 
consecutive sweeps must be set up so that the next sweep can proceed with correct 
boundary conditions on the interface to the previous sweep. If t is the number of threads 
in a group, a boundary array must thus hold t planes in z-x direction. Hence no additional 
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Figure 9: Wavefront temporal blocking results for the Gauss-Seidel smoother 
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Figure 10: Wavefront temporal blocking results for the Gauss-Seidcl smoother with SMT 



computations are necessary for the boundary treatment. 

Figure [5] shows the results for Jacobi with temporal wavefront blocking on one socket. 
The baselines drawn on the right are the threaded results without temporal blocking for 
a problem size of 200x200x200. On Core 2 there is a very large gap between in-cache and 
in-memory memory baselines, indicating a considerable potential for temporal blocking. 
A speedup of two could be achieved by wavefront temporal blocking, but to leverage the 
potential on this architecture a bigger blocking factor (i.e., more cores per thread group) 
would be required in order to decouple from main memory bandwidth. A completely 
different picture can be seen on the Nehalem EP. Here the gap between in-cache and 
memory performance is rather small. The threaded memory performance utilizing non- 
temporal stores is already 1008 MLUPS. While a speedup of 25-50% seems fair, the 
scaling of memory bandwidth with the number of cores results in a high baseline and 
limits the benefit of our optimization. The result on the Westmere processor is similar, 
still there is a speedup benefit from the higher blocking factor of six. The Intel Nehalem 
EX in the configuration used here shows the highest benefit. It allows a blocking factor of 
eight together with a very high bandwidth L3 cache. On the other hand, this (artificial) 
configuration is strongly bandwidth-starved in terms of available bandwidth per core. 
This combination results in speedups of four independent of the problem size. Istanbul 
has a good initial position promising a significant speedup, there is a large gap between 
saturated in-cache and main memory performance, and the number of cores per socket 
allows a blocking factor of six. However, it only achieves speedups comparable to Nehalem 
EP. 

Our scheme for temporal wavefront parallelization can be adapted for the in-place 
Gauss-Seidel method if used in combination with the pipeline parallel approach. Since all 
updates operate on one array, additional temporary arrays are not required. To ensure 
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the correct update ordering the updates have to be shifted for lexicographic Gauss-Seidel 
between thread groups as illustrated in Fig. [5] (b). This is a natural extension to the 
threaded pipelined parallelization introduced in Sec. EH The results for temporal wave- 
front blocking with Gauss-Seidel arc shown in Fig. [S] Again the baseline of a threaded 
Gauss-Seidel with pipeline parallelization for a size of 200x200x200 is indicated on the 
right y axis. On the Core 2 the combination of low (and non-scaling) main memory 
performance together with only a blocking factor of two leads to a speedup of nearly 
two for temporal blocking. Nehalem EP shows improvements of 30-40%. Westmere 
profits from its two additional cores and hence from the deeper blocking factor, reaching 
a speedup of over 50 %. The potential of our optimization technique is again seen on the 
example of Nehalem EX. Despite its hardware configuration yielding the lowest band- 
width compared to the other Nehalem processors, it reaches the best overall performance 
and can fully benefit from its eight cores and strong L3 cache subsystem, showing an 
impressive speedup of 3.8. The Istanbul architecture again shows disappointing results 
comparable to the Nehalem EP. Its exclusive cache hierarchy seems to be unsuited for 
these bandwidth-demanding in-cache loops, but the exact reason for the small gains of 
our optimization were not investigated in more detail. 

As noted in Sec. [3j Gauss-Seidel cannot be fully pipelined due to its recursive struc- 
ture. While our optimized kernel reduces this penalty, it is still noticeable, causing the 
floating point units to be underutilized. For exactly this situation, a shared resource be- 
ing not fully used, the Nehalem processors implement SMT with two hardware threads 
sharing a physical core. Further possible benefits of using SMT threads for our tempo- 
ral wavefront optimization are manifold, deeper blocking factors being just one effect. 
Since main memory bandwidth on the Nehalem (EP and Westmere) processors scales 
with the number of threads, the utilization of the memory bus can be increased by us- 
ing multiple thread groups. And finally, two SMT threads also share the LI and L2 
caches, potentially reducing necessary cacheline transfers in our pipelined wavefront set- 
ting. Figure [T0l shows the results, with filled symbols denoting SMT data. Nehalem EP 
and Westmere now achieve speedups of 2.5 versus the threaded baseline. The SMT bene- 
fit on Nehalem EX is not that large, which could be caused by the fact that this chip was 
already arithmetically limited. An indication supporting this conjecture is that Nehalem 
EP Westmere and Nehalem EX now reach a comparable performance in accordance to 
their similar arithmetic peak performance per socket. Nehalem EX can reach an overall 
speedup of up to five against its threaded baseline. 

5. Conclusion 

Wc have presented a novel way to implement temporal blocking, specifically designed 
to leverage the shared outer-level cache on today's multicorc architectures. The opti- 
mizations were evaluated on a wide range of current multicore processors to show their 
potential. Besides the well-known Jacobi method we have presented a highly efficient 
implementation of the recursive Gauss-Seidel method. For the threaded and tempo- 
ral wavefront implementation of Gauss-Seidel wc employed a pipeline parallel approach, 
retaining the update ordering of the serial algorithm. The optimization reaches con- 
siderable speedups on all architectures. Results on our (artificially) bandwidth-starved 
Nehalem EX system confirm that a large ratio between in-cache and memory bandwidths 
improves the gain for our temporal blocking approach. The cache subsystem of the AMD 
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Istanbul turned out to be incapable of benefitting from our optimizations to the same 
extent as comparable Intel architectures. Finally we have shown that employing SMT 
threads for temporal blocking of the Gauss-Seidcl solver yields large performance im- 
provements with overall speedups of up to five on Intel Nehalem EX. It is noteworthy 
that for this case both Nehalem EP and Nehalem EX reach their full arithmetic potential 
independent of their different memory bandwidth capabilities. 



Acknowledgment 

We arc indebted to Intel Germany for providing test systems and early access hard- 
ware for benchmarking. 



References 

B. Bergen, F. Hulsemann, U. Rude: Is 1.7xl0 10 Unknowns the Largest Finite Element System 
that Can Be Solved Today? In: ACM/IEEE (Ed.): Proceedings of the ACM/IEEE SC 2005 
Conference (Supercomputing Conference '05, Seattle, Nov 12-18, 2005). 

G. Hagcr, F. Deserno, G. Wellein: Pseudo-Vectorization and RISC Optimization Techniques for 
the Hitachi SR8000 Architecture In: High Performance Computing in Science and Engineering 
Munich 2002 (editor S. Wagner ct al.), Springer, pp. 425-442, 2003 

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, K. Yelick: 
Stencil Computation Optimization and Auto-tuning on State- of-the- Art Muticore Architectures. 
In: ACM/IEEE (Ed.): Proceedings of the ACM/IEEE SC 2008 Conference (Supercomputing 
Conference '08, Austin, TX, Nov 15-21, 2008). 

K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, K. Yelick: Optimization and Performance 
Modeling of Stencil Computations on Modern Microprocessors. In: SIAM Rev.: vol. 51, No. 1, 
pp. 129-159, 2009. 

G. Wellein, G. Hagcr, T. Zciser, M. Wittmann and H. Fehske: Efficient temporal blocking 
for stencil computations by multicore- aware wavefront parallelization. Proc. COMPSAC 2009. 
DOI: 10 . 1 109/COMPSAC . 2009 . 82 

M. Kowarschik: Data Locality Optimizations for Iterative Numerical Algorithms and Cellular 
Automata on Hierarchical Memory Architectures. PhD thesis, July 2004, SCS Publishing House, 
Germany. ISBN 3-936150-39-7. 

J. Treibig: Efficiency Improvements of Iterative Numerical Algorithms on Modern Architectures. 
PhD thesis, July 2009, URN: urn:nbn:de:bvb:29-opus-14036. 

M. Frigo, C.E. Leiserson, H. Prokop, S. Ramachandran: Cache-Oblivious Algorithms. In: 40th 
Annual Symposium on Foundations of Computer Science, FOCS 99, Oct 17-18, 1999, New York, 
NY. 

T. Zeiser, G. Wellein, A. Nitsure, K. Iglberger, U. Rude, G. Hager: Introducing a parallel cache 
oblivious blocking approach for the lattice Boltzmann method. Progress in CFD, vol. 8, No. 1-4, 
pp. 179-188, 2008. 

D. J. Kerbyson, A. Hoisie: Analysis of Wavefront Algorithms on Large-scale Two-level Heteroge- 
neous Processing Systems. In Proc. Workshop on Unique Chips and Systems (UCAS2), IEEE Int. 
Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, 2006. 
J.D. McCalpin: STREAM: Sustainable Memory Bandwidth in High Performance Computers. 

http://www.cs.virginia.ed u/streaml 

LIKWID tool suite, http://codc.google.eom/p/likwid/ 

M. Wittmann, G. Hagcr, G. Wellein: Multicore- aware parallel temporal blocking of stencil codes 
for shared and distributed memory. Accepted for LSPP10, the Workshop on Large-Scale Parallel 
Processing at IPDPS 2010, April 23rd, 2010, Atlanta, GA. larXiv:0912. 45061 
[14] J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth Limited Loop Kernels. 
Workshop on Memory Issues on Multi- and Manycore Platforms, PPAM2009. 



15 



Gauss Seidel 



320 b 



384 b 



128 b 



128 b 



Core 






A 


Ll 


M 




w 


L2 


1 


V 


L3 


) 





36 cycles 

per double update 

20 b/cycle per unknown 



12 cycles 

32 b/cycle 
4 cycles 

32 b/cycle 

5.3 mem cycles 
= ca. 13 cycles 

24 b/mem cycle 



MEM 



Core 

mm 



LI 

am 



320 b 



L2 

am 



320 b 



192 b 



L3 

u 



27 cycles 
per cacheline 

19.2 b/cycle 



10 cycles 
32 b/cycle 
10 cycles 

32 b/cycle 

8.0 mem cycles 
= ca. 20 cycles 

24 b/mem cycle 



MEM 



800 




t 



t 



.2/L3 
/lemc 



Threadgroup 



I ] Threadgroup 1 
| | Threadgroup 2" 
| | Threadgroup 3 



Temporary Array 



Is 



Temporary Boundary Array 



Group N-l Block Group Block 




source Array 



thread 1 



Updates: □ l □ 2 [J 3 Q 4 





tmp field + boundary fiel 



Updates: □ 1 □ 2 □ 3 □ 4 





tmp field + boundary fiel 



Updates: □ 1 □ 2 □ 3 □ 4 





tmp field + boundary fiel 



Updates: □ 1 □ 2 □ 3 □ 4 





tmp field + boundary fiel 



Updates: □ 1 □ 2 [J 3 □ 4 





tmp field + boundary fiel 



Updates: □ 1 □ 2 □ 3 □ 4 




Updates: |_| 1 "] 2 ~] 3 □ 4 




